feat(python): expose rust writer as additional engine #1872

ion-elgreco · 2023-11-16T21:33:29Z

Description

First version of a functional python binding to rust writer. Lots of work is happening around the writer. This one we should merge first: #1820.

A couple gaps will exist between current Rust writer and pyarrow writer. We will have to solve this in a later PR,:

Overwrite schema in Rust
Replacewhere (partition filter / predicate) overwrite

Related Issue(s)

closes Add a Rust-backed engine for write_deltalake #1861

ion-elgreco · 2023-11-16T21:38:40Z

@wjones127 @roeap Any early feedback on how to tackle the missing features and whether I am overseeing something would be much appreciated 😄.

roeap

Certainly :).

Overall I believe this is going in the right direction. one thing that may be a larger piece of work is not collecting the iterator, but rather wrapping it in an execution plan. Maybe we could also live with collect for now to get the wiring right, and leave the execution plan to a follow-up.

roeap · 2023-11-16T22:04:19Z

python/src/lib.rs

+    configuration: Option<HashMap<String, Option<String>>>,
+    storage_options: Option<HashMap<String, String>>,
+) -> PyResult<()> {
+    let batches = data.0.map(|batch| batch.unwrap()).collect::<Vec<_>>();


We should probably try to not fully materialize the RBR into memory. This would entail wrapping the RBR in an ExecutionPlan, which can be passed to the CreateBuilder. Maybe DatasetExec from datafusion-python can provide some inspiration.

Maybe we can do that in a second PR then, since I was also doing it like this for MERGE python binding. Then I can improve these two at the same time.

python/src/lib.rs

roeap · 2023-11-16T22:08:03Z

crates/deltalake-core/src/operations/write.rs

@@ -108,6 +108,8 @@ pub struct WriteBuilder {
    write_batch_size: Option<usize>,
    /// RecordBatches to be written into the table
    batches: Option<Vec<RecordBatch>>,
+    /// whether to overwrite the schema
+    overwrite_schema: bool,


is this meant for schema evolution? If so, I'd recommend moving that to a follow-up PR as it would likely blow up this PR quite a bit.

+1. I think it's fine if we let that return NotImplementedError for now.

It was an quick attempt for schema evolution. I was able to write except it didn't write the columns that were not part of the original schema, so I need to dig through the code more.

Ok, let's do this as improvement in another update.

we also would need to update all read path to always add null columns for columns non-existent in older parquet files. Haven't looked into it, but this would likely require some larger refactoring particularly in the datafusion DeltaScan. Saying this we likely need to validate that added columns are always nullable as well.

I think that would be schema evolution for also appends.

The PyArrow writer can do schema evolution but only combined with an overwrite mode.

I think that's purely a metadata action then. Would this be doable with the existing deltalake-core crate?

not sure, we may end up with unreadble tables if we do this... if we replace the whole table this might work.

With PyArrow it only works together with overwrite so it should be safe. Is there a way to adjust the commit that's written during a write?

python/deltalake/writer.py

python/tests/test_writer.py

ion-elgreco · 2023-11-18T14:51:16Z

@roeap @wjones127 it's ready for a full review now. That one failing test will be solved when we merge this PR #1820, I've also updated the PR description, basically we have two gaps with pyarrow and rust writer.

roeap · 2023-11-18T18:24:54Z

While looking through this I realized we may have a bug with the schema overwrite right now.

When we added this, overwrite would always replace the entire table. Since then we added selectively overwriting partitions, in which case I believe we may be corrupting the table, or at least make it unreadable to our readers, since we end up with parquet files with different schemas in the table.

Not sure if the pyarrow Dataset can handle that, but almost certain, our rust readers would not be able to handle that.

Should we include as an additional check, that no partition filters are supplied, if the schema is overwritten?

ion-elgreco · 2023-11-18T18:51:00Z

While looking through this I realized we may have a bug with the schema overwrite right now.

When we added this, overwrite would always replace the entire table. Since then we added selectively overwriting partitions, in which case I believe we may be corrupting the table, or at least make it unreadable to our readers, since we end up with parquet files with different schemas in the table.

Not sure if the pyarrow Dataset can handle that, but almost certain, our rust readers would not be able to handle that.

Should we include as an additional check, that no partition filters are supplied, if the schema is overwritten?

For the rust writer? Because currently only pyarrow is using them.

roeap · 2023-11-18T18:55:19Z

For the rust writer? Because currently only pyarrow is using them.

Yes, and the pyarrow writer may be creating corrupted tables that we cannot read from rust and maybe even python if partition filters are supplied. and the schema is updated. Essentially we may need to handle very different data for some of the files we read, as we make no checks what the schema change looks like.

@junjunjd

Exposes added `convert to delta` functionality by @junjunjd to Python API. - closes delta-io#1767 --------- Co-authored-by: Robert Pack <[email protected]>

# Description This refactors the merge operation to use DataFusion's DataFrame and LogicalPlan APIs The NLJ is eliminated and the query planner can pick the optimal join operator. This also enables the operation to use multiple threads and should result in significant speed up. Merge is still limited to using a single thread in some area. When collecting benchmarks, I encountered multiple OoM issues with Datafusion's hash join implementation. There are multiple tickets upstream open regarding this. For now, I've limited the number of partitions to just 1 to prevent this. Predicates passed as SQL are also easier to use now. Manual casting was required to ensure data types were aligned. Now the logical plan will perform type coercion when optimizing the plan. # Related Issues - enhances delta-io#850 - closes delta-io#1790 - closes delta-io#1753

# Description Implements benchmarks that are similar to Spark's Delta benchmarks. Enable us to have a standard benchmark to measure improvements to merge and some pieces can be factored out to build a framework for bench marking delta workflows.

This reverts commit 57565b5.

ion-elgreco · 2023-11-21T10:16:52Z

Closing this in favour of: #1891

github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate crate/core labels Nov 16, 2023

roeap reviewed Nov 16, 2023

View reviewed changes

wjones127 reviewed Nov 17, 2023

View reviewed changes

python/deltalake/writer.py Outdated Show resolved Hide resolved

wjones127 reviewed Nov 17, 2023

View reviewed changes

python/tests/test_writer.py Show resolved Hide resolved

ion-elgreco force-pushed the feat/expose_rust_writer_as_optional_engine branch from 2932609 to 4f8f359 Compare November 17, 2023 08:01

ion-elgreco added 11 commits November 18, 2023 15:38

first version

ca2acb8

add try from uri with storage options

3f55470

Start to enable overwrite_schema

714cc56

add tests to check rust py03 writer

52f0d6a

remove comment

c067d4d

rename and clean up

5c5f247

add float type support in partition cols

717b7c7

check for pandas

dae361e

add support for name and desc

3052f12

fmt

01f0194

improve tests and add config support

9911574

ion-elgreco force-pushed the feat/expose_rust_writer_as_optional_engine branch from dd0b4da to 9911574 Compare November 18, 2023 14:38

parametrize write benchmark

a410dfa

ion-elgreco marked this pull request as ready for review November 18, 2023 14:49

ion-elgreco requested review from fvaleye and rtyler as code owners November 18, 2023 14:49

ion-elgreco requested review from wjones127 and roeap November 18, 2023 15:12

add LargeUtf8 support in partition stringify

8c976b6

roeap and others added 8 commits November 20, 2023 21:41

refactor: express log schema in delta types

57565b5

feat(python): expose convert_to_deltalake (delta-io#1842)

49a298b

Exposes added `convert to delta` functionality by @junjunjd to Python API. - closes delta-io#1767 --------- Co-authored-by: Robert Pack <[email protected]>

Revert "refactor: express log schema in delta types"

07113c6

This reverts commit 57565b5.

Merge branch 'main' into feat/expose_rust_writer_as_optional_engine

3a8c026

formatting

3e25561

use fromstr

d9a4ce0

rtyler mentioned this pull request Nov 21, 2023

feat(python): expose rust writer as additional engine v2 #1891

Merged

ion-elgreco closed this Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python): expose rust writer as additional engine #1872

feat(python): expose rust writer as additional engine #1872

ion-elgreco commented Nov 16, 2023 •

edited

Loading

ion-elgreco commented Nov 16, 2023 •

edited

Loading

roeap left a comment

roeap Nov 16, 2023

ion-elgreco Nov 17, 2023

roeap Nov 16, 2023

wjones127 Nov 17, 2023

ion-elgreco Nov 17, 2023

roeap Nov 17, 2023

ion-elgreco Nov 17, 2023

roeap Nov 17, 2023

ion-elgreco Nov 17, 2023

ion-elgreco commented Nov 18, 2023

roeap commented Nov 18, 2023

ion-elgreco commented Nov 18, 2023

roeap commented Nov 18, 2023 •

edited

Loading

ion-elgreco commented Nov 21, 2023

feat(python): expose rust writer as additional engine #1872

feat(python): expose rust writer as additional engine #1872

Conversation

ion-elgreco commented Nov 16, 2023 • edited Loading

Description

Related Issue(s)

ion-elgreco commented Nov 16, 2023 • edited Loading

roeap left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ion-elgreco commented Nov 18, 2023

roeap commented Nov 18, 2023

ion-elgreco commented Nov 18, 2023

roeap commented Nov 18, 2023 • edited Loading

ion-elgreco commented Nov 21, 2023

ion-elgreco commented Nov 16, 2023 •

edited

Loading

ion-elgreco commented Nov 16, 2023 •

edited

Loading

roeap commented Nov 18, 2023 •

edited

Loading